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ABSTRACT 

Classical erasure codes, e.g. Reed-Solomon codes, have 
been acknowledged as an efficient alternative to plain repli- 
cation to reduce the storage overhead in reliable distributed 
storage systems. Yet, such codes experience high overhead 
during the maintenance process. In this paper we propose a 
novel erasure-coded framework especially tailored for net- 
worked storage systems. Our approach relies on the use of 
random codes coupled with a clustered placement strategy, 
enabling the maintenance of a failed machine at the gran- 
ularity of multiple files. Our repair protocol leverages net- 
work coding techniques to reduce by half the amount of data 
transferred during maintenance, as several files can be re- 
paired simultaneously. This approach, as formally proven 
and demonstrated by our evaluation on a public experimen- 
tal testbed, enables to dramatically decrease the bandwidth 
overhead during the maintenance process, as well as the time 
to repair a failure. In addition, the implementation is made 
as simple as possible, aiming at a deployment into practical 
systems. 

1. INTRODUCTION 

Redundancy is key to provide a reliable service in prac- 
tical systems composed of unreliable components. Typi- 
cally distributed storage systems heavily rely on redundancy 
to mask ineluctable disk/node unavailabilities and failures. 
While three-way replication (often called triplication) is the 
standard means to obtain reliability with redundancy, it is 
now acknowledged that erasure codes can dramatically im- 
prove the storage efficiency (32). In other words, for a 
given reliability guarantee, the storage overhead for such 
codes is reduced by order of magnitude compared to replica- 
tion. Several major cloud systems as those of Microsoft [4 ] 
or Google (13) have recently adopted erasure codes (more 
specifically Reed-Solomon codes). Facebook was experi- 
menting them in 2010 [30]. There is thus a tangible move 
of cloud operators from replication to erasure coding, allow- 
ing a more efficient use of scalability-critical resources. 

Reed-Solomon codes are the de facto standard of code- 
based redundancy in practice. Yet, those codes have been 
designed and optimized to deal with lossy communication 
channels, rather than specifically targeting networked stor- 



age systems. In fact those codes only provide tolerance to 
transient failures, the level of redundancy irrevocably de- 
creasing with host-node failures over time. An additional 
maintenance mechanism is thus key to preserve the relia- 
bility of stored information over time, as far as it is well 
known that storage systems have grown to a scale where 
failures have become the norm. However, Reed-Solomon 
codes are precisely known to suffer from important overhead 
in terms of bandwidth utilization and decoding operations 
when maintenance has to be triggered. In order to address 
these two drawbacks, architectural solutions have been pro- 
posed (26) , as well as new code designs [11, 19,20], paving 
the way for better tradeoffs between storage, reliability and 
maintenance efficiency. The optimal tradeoff has been very 
recently provided by Dimakis & al (7) with the use of net- 
work coding. However open issues regarding the feasibility 
of deploying those new codes in practical distributed stor- 
age systems remain. Indeed, very few studies evaluate how 
hard it is to implement theses codes in a production sys- 
tem [10], as most of them are theoretical. Moreover those 
new codes are examined under the simplifying assumption 
that only one file is stored per failed machine, thus ignor- 
ing practical issues when dealing with the maintenance of 
multiple files. 

Interestingly enough, an appealing alternative for perfor- 
mance is to use randomness. Randomness can provide a 
simple and efficient way to construct optimal codes w.h.p. , as 
are Reed-Solomon ones, while offering suitable properties in 
terms of maintenance. Random Codes have been identified 
as good candidates to provide fault tolerance in distributed 
storage systems [ 1 , 8 , 15 , 23 ]. Yet, maintaining such promis- 
ing codes has not been considered in practice so far. In this 
paper we propose a novel approach to redundancy manage- 
ment, combining both random codes and network coding, to 
provide an efficient maintenance protocol usable in practice. 
The main intuition behind our approach is to apply ran- 
dom codes and network coding at the granularity of clus- 
ters hosting, enabling to factorize the repair cost across 
several files at the same time. This mechanism is made as 
simple as possible, both in terms of design and implemen- 
tation with the purpose of leveraging the power of erasure 
codes, while reducing its known drawbacks. 



More specifically, our contributions are the following: 

1. We propose a novel maintenance mechanism which 
combines a clustered placement strategy, random 
codes and network coding techniques at the node level 
(i.e., between different files hosted by a single ma- 
chine). This approach is called CNC in the sequel, for 
Clustered Network Coding. CNC enables to halve the 
data transferred compared to standard erasure codes 
during the maintenance process. The overhead in 
terms of decoding operations is also reduced by or- 
der of magnitude compared to the reparation process of 
classical erasure codes. Moreover, CNC enables rein- 
tegration (i.e., the capability to reintegrate nodes which 
have been wrongfully declared as failed). Finally the 
network load is evenly balanced between nodes during 
the maintenance process, using a simple random selec- 
tion. This enables the storage system to scale with the 
number of files to repair, as the available bandwidth is 
consumed as efficiently as it could be. Performance 
claims of CNC are formally proven. 

2. We deployed CNC on a public execution platform, 
namely Grid500(Q to evaluate its benefits. In the typ- 
ical setup of a datacenter storage system, data trans- 
ferred, storage needs and repair time have been mon- 
itored. We compared our solution to both triplication 
and Reed-Solomon codes. Experimental results show 
that the data transferred for maintenance is reduced 
by half compared to codes while consuming the same 
storage space and providing the same data availability. 
The combination of the data transfer reduction, decod- 
ing operations avoidance, together with a clever use of 
the available bandwidth, has a strong impact on the ef- 
ficiency of maintenance operation: the time to repair a 
failed node is dramatically reduced thus enhancing the 
whole reliability of the system. 

The rest of the paper is organized as follows. We first re- 
view the background on maintenance techniques using era- 
sure codes in Section [2] Our novel approach is presented in 
Section [3] and analyzed in Section]?]. We then evaluate and 
compare it against state of the art approaches in Section [5] 
Finally, we present related work in Section [6] and conclude 
this paper. 

2. MOTIVATION AND BACKGROUND 
2.1 Maintenance in Storage Systems 

Distributed storage systems are designed to provide reli- 
able storage service over unreliable components |6{ [T4]|2T] 
[27). One of the main challenges of such systems is their 
ability to overcome unavoidable component failures p3pT| . 
Fault tolerance usually relies on data redundancy; the classi- 
cal triplication is the storage policy adopted by Hadoop (28) 
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or the Google file system (14) for example. Data redundancy 
must be complemented with a maintenance mechanism able 
to recover from the loss of data when failures occur in order 
to preserve the reliability guarantees of the system over time. 
Maintenance has already lain at the very heart of numerous 
storage systems design (5][T2|[T6, 29|. Similarly, reintegra- 
tion, which is the capability to reintegrate replicas stored on 
a node wrongfully declared as failed, was shown in \5\ to be 
one of key techniques to reduce the maintenance cost. All 
these studies focused on the maintenance of replicas. While 
plain replication is easy to implement and easy to maintain, 
it suffers from a high storage overhead, typically x instances 
of the same file are needed to tolerate x — 1 simultaneous 
failures. This high overhead is a growing concern especially 
as the scale of storage systems keeps increasing. This moti- 
vates system designers to consider erasure codes as an alter- 
native to replication. Yet, using erasure codes significantly 
increases the complexity of the system and challenges de- 
signers for efficient maintenance algorithms. 

2.2 Erasure Codes in Storage Systems 

Erasure codes have been widely acknowledged as much 
more efficient than replication (32) with respect to storage 
overhead. More specifically, Maximum Distance Separable 
(MDS) codes are optimal: for a given storage overhead (i.e. 
the rate between the original quantity of data to store and 
the quantity of data including redundancy), MDS codes pro- 
vide the optimal efficiency in terms of data availability. Let 
us now remind the reader about the basics of an MDS code 
(n, k): a file to store is split into k chunks, encoded into n 
blocks with the property that any subset of k out of n blocks 
suffices to reconstruct the file. Thus, to reconstruct a file of 
M Bytes one needs to download exactly M Bytes, which 
corresponds to the same amount of data as if plain replica- 
tion were used. Reed-Solomon codes are a classical exam- 
ple of MDS codes, and are already deployed in cloud-based 
storage systems (4] [13). However, as pointed out in [26], 
one of the major concern of erasure codes lies in the mainte- 
nance process, which incurs an important overhead in terms 
of bandwidth utilization as well as in decoding operations as 
explained below. 

Maintenance of Erasure Codes. 

When a node is declared as failed, all blocks of the files it 
was hosting need to be re-created on a new node; we call this 
operation a repair in the sequel. The repair process works as 
follows (see Figure[T]): to repair one block of a given file, the 
new node first needs to download k blocks of this file (i.e., 
corresponding to the size of the file) to be able to decode it. 
Once decoded, the new node can re-encode the file and then 
regenerate the lost redundant block. This must be iterated 
for all the lost blocks. Three issues arise: 

1. Repairing one block (typically a small part of a file) 
requires the downloading of enough blocks by the new 
node (i.e. k) to reconstruct the entire file, and this 
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Figure 2: Example of the creation process of encoded 
blocks using a random code. Any k=2 blocks is enough 
to reconstruct the file X 



Figure 1: Classical repair process of a single block. 
(k=2,n=4) 



must be done for all the blocks previously stored on 
the failed node. 

2. The new node must then decode the file, though it does 
not want to access it. Decoding operations are known 
to be time consuming especially for large files. 

3. Reintegrating a node which has been wrongfully de- 
clared as faulty is almost useless. This is due to the fact 
that the new blocks created during the repair opera- 
tion have to be strictly identical to the lost ones for this 
is necessary to sustain the coding strategy |^] There- 
fore reintegrating a node results is having two identi- 
cal copies of the involved blocks (the reintegrated ones 
and the new ones). Such blocks can only be useful if 
either the reintegrated node or the new node fails but 
not in the event of any other node failure. 

In order to mitigate these drawbacks, various solutions 
have been suggested. Lazy repairs for instance as described 
in (3j consists in deliberately delaying the repairs, wait- 
ing for a successive amount of defects before repairing all 
the failures together. This enables to repair multiple fail- 
ures while only suffering from bandwidth (i.e. data trans- 
ferred) and decoding overhead once. However delaying re- 
pairs leaves the system more vulnerable in case of a burst of 
failures. Architectural solutions have also been proposed, as 

2 This can be achieved either by a tracker maintaining the global 
information about all blocks or by the new node inferring the exact 
structure of the lost blocks from all existing ones. 



for example the Hybrid strategy [ 26 ] . This consists in main- 
taining one full replica stored on a single node in addition to 
multiple encoded blocks. This extra replica is thus utilized 
when repairs have to be triggered. However maintaining 
an extra replica on a single node significantly complicates 
the design, while incurring scalability issues. Finally, new 
classes of codes have been designed |TT|[T9) which trade op- 
timally in order to offer a better tradeoff between storage, 
reliability and maintenance efficiency. 

A Case for Random Codes. 

In this paper, we argue that random linear codes (ran- 
dom codes for short) may offer an appealing alternative to 
classical erasure codes in terms of storage efficiency and re- 
liability, while considerably simplifying their maintenance 
process. Random codes have been initially evaluated in 
the context of distributed storage systems in (TJ. Authors 
showed that random codes can provide an efficient fault tol- 
erant mechanism with the property that no synchronization 
between nodes is required. Instead, the way blocks are gen- 
erated on each node is achieved independently in such a 
way that it will fit the coding strategy with high probabil- 
ity. Avoiding such synchronization is crucial in distributed 
settings, as also demonstrated in (T5). 

The basic principle of encoding a file using random codes 
is simple: each file is divided into k chunks and the blocks 
stored for reliability are created as random linear combina- 
tions of these k blocks (see Figure[2]). All blocks, along with 
their associated coefficients, are then stored on n different 
nodes. Note that the additional storage space required for 
the coefficients is typically negligible compared to the size 
of each block. 

In order to reconstruct a file initially encoded with a given 
k, one needs to download k different blocks of this file. The- 
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ory on random matrix over finite field ensures that if one 
takes k random vectors of the same subspace, these k vec- 
tors are linearly independent with a probability which can be 
made arbitrary close to one, depending on the field size (TJ. 
This is a key difference with erasure codes, avoid any syn- 
chronization between nodes. In other words, an encoded file 
can be reconstructed as soon as any set of k encoded blocks 
is collected, and as already mentioned, this is optimal (MDS 
codes). 

3. CLUSTERED NETWORK CODING 

Our CNC system is designed to sustain a predefined level 
of reliability, i.e. of data redundancy, by recovering from 
failures with a limited impact on performances. We assume 
that the failure detection is performed by a monitoring sys- 
tem, the description of which is out of the scope of this paper. 
We also assume that this system triggers the repair process, 
assigning new nodes to replace the faulty ones, in charge of 
recovering the lost data and store it. 

The predefined reliability level is set by the storage sys- 
tem operator. This reliability level then directly translates 
into the redundancy factor to be applied to files to be stored, 
with parameters k (number of blocks sufficient to retrieve a 
file) and n (total number of redundant blocks for a file). A 
typical scenario for using CNC is a storage cluster like in 
the Google File System (14), where files are streamed into 
extents of the same size, for example 1GB as in Windows 
Azure Storage [4]. These extents are erasure coded in order 
to save storage space. 

3.1 A Cluster-based Approach 

To provide an efficient maintenance, CNC relies on (i) 
hosting all blocks related to a set of files on a single cluster of 
nodes, and (ii) repairing multiple files simultaneously . This 
is achieved by combining the use of random codes, network 
coding and a cluster-based placement strategy. This enables 
to repair several files simultaneously, without requiring com- 
putationally intensive decoding operations, thus factorizing 
the costs of repair across the several multiple files stored by 
the faulty node. To this end, the system is partitioned into 
disjoint clusters of n nodes, so that each node of the storage 
system belongs to one and only one cluster. Each file to be 
stored is encoded using random codes and is associated to a 
single cluster. All blocks of a given file are then stored on 
the n nodes of the same cluster. In other words, CNC place- 
ment strategy consists in storing blocks of two different files 
belonging to the same cluster on the same set of nodes, as 
illustrated on Figure [3] 

In such a setup, the storage system manager (e.g. the mas- 
ter node in the Google File System (14)) only needs to main- 
tain two data structures: an index which maps each file to 
one cluster and an index by cluster which contains the set 
of the identifier of nodes in this cluster. This simple data 
placement scheme leads to significant data transfer gains and 
better load balancing, by clustering operations on encoded 




Figure 3: Clustered placement for a n = 3 redundant 
system 



blocks, as explained in the remaining part of this section. 

3.2 Maintenance of CNC 

When a node failure is declared, the maintenance opera- 
tion must ensure that all the blocks hosted on the faulty node 
are repaired in order to preserve the redundancy factor and 
hence the predefined reliability level of the system. Repair 
is usually performed at the granularity of a file. Yet, a node 
failure typically leads to the loss of several blocks, involving 
several files. This is precisely this characteristic that CNC 
leverages. Typically, when a node fails, multiple repairs are 
triggered, one for each particular block of one file that the 
failed node was storing. Traditional approaches using era- 
sure codes actually consider a failed node as the failure of 
all of its blocks. Instead, the novelty of CNC is to lever- 
age network coding at the node level, i.e. between dif- 
ferent files on a particular cluster. This is possible since 
CNC placement strategy clusters files so that all nodes of 
a cluster store the same files. This technical shift enables 
to significantly reduce the data to be transferred during the 
maintenance process. 

3.3 An Illustrating Example 

To provide the intuition of CNC, and before generalizing 
in the next section, we now describe a simple example (see 
Figure [4} involving two files and a 4 node cluster. We con- 
sider two files X and Y of size M = 1024 MB, encoded 
with random codes (k = 2, n = 4), stored on the 4 nodes of 
the same cluster (i.e. Nodes 1 to 4). File X is chunked into 
k = 2 chunks X\, X2 as well as file Y into chunks Y\ and 
Y 2 . Each node stores two encoded blocks, one related to file 
X and the other to file Y which are respectively a random 
linear combination of {Xi,^} and {Yi, Y2}. Each block 
has a size of ^ = 512 MB, thus each node stores a total of 
2 x 512 = 1024 MB. We now consider the failure of Node 
4. 

In a classical repair process, the new node asks to k = 2 
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Figure 4: Example of a CNC repair process, for the repair of a new node in a cluster of 4 (with k = 2, n — 4). 



nodes their block corresponding to file X and Y and thus 
downloads 4 blocks, for a total of 4 x 512 = 2048 MB. This 
enables the new node to decode the two files independently, 
and then re-encode each file to regenerate one block for file 
X and one for file Y and store them. 

Instead, CNC leverages the fact that the encoded blocks 
related to files X and Y are stored on the same node and 
restored on the same new node to encode the files together 
rather than independently during the repair process. More 
precisely, if the nodes are able to compute a linear combi- 
nation of their encoded blocks, we can prove that if k = 2, 
only 3 blocks are sufficient to perform the repair of the two 
files X and Y. Thus the transfer of only 3 blocks incurs the 
download of 3 x 512 = 1536 MB, instead of the 2048 MB 
needed with the classical repair process. In addition, this re- 
pair can be processed without decoding any of the two files. 
In practice, the new node has to contact the three remaining 
nodes to perform the repair. Each of the three nodes sends 
the new node a random linear combination of its two blocks 
with the associated coefficients. Note that the two files are 
now mixed, i.e. encoded together. However, we want to be 
able to access each file independently after the repair. The 
challenge is thus to create two new random blocks, with 
the restrictions that one is only a random linear combina- 
tion of the X blocks, and the other of the Y blocks. In this 
example, finding the appropriate coefficients in order to can- 
cel the Xi or Yi, comes down to solve two independent sys- 
tems of two equations with three unknowns as shown in Fig- 



ure]?] The intuition is that, as coefficients of these equations 
are random, these two systems are always solvable w.h.p.. 
The new node then makes two different linear combinations 
of the three received blocks according to the previously com- 
puted coefficients, (A, £>, C) and (D, E, G) in the example. 
Thereby it creates two new independent random blocks, one 
related to file X and one to file Y. The repair is then per- 
formed, saving the bandwidth consumed by the transfer of 
one block i.e., 512 MB in this example. 

3.4 CNC: The General Case 

We now generalize the previous example for any k. We 
first define a RepairBlock object: a RepairBlock is a random 
linear combination of two encoded blocks of two different 
files stored on a given node. RepairB locks are transient ob- 
jects which only exist during the maintenance process i.e., 
RepairB locks are never stored permanently. 

We are now able to formulate the core technical result of 
this paper; the following proposition applies in a context 
where different files are encoded using random codes with 
the same k, and the encoded blocks are placed according to 
the cluster placement described in the previous section. 

PROPOSITION 1. In order to repair two different files, 
downloading k + 1 RepairB locks from k + 1 different nodes 
is a sufficient condition. 

Repairing two files jointly actually comes down to create 
one new random block for each of the two files; the formal 
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Figure 5: One iteration of the repair process, at the end 
of which two encoded blocks are repaired. 
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Figure 6: Necessary amount of data to transfer to re- 
pair a failed node, according to the selected redundancy 
scheme (files of 1 GB each). 



proof of this proposition is given in Appendix. This proposi- 
tion implies that instead of having to download 2k blocks as 
with Reed-Solomon codes when repairing, CNC decreases 
that need to only k + 1. Other implications and analysis are 
detailed in the next section. 

We shall notice that the encoded blocks of the two files do 
not need to have the same size. In case of different sizes, the 
smallest is simply zero-padded during the network coding 
operations as usually done in this context; padding is then 
removed at the end of the repair process. 

Figure [5] describes one iteration of the process at the end 
of which two encoded blocks are repaired. Each of the k + 1 
nodes sends a RepairBlock to the new node, which then 
combines them to restore the two lost encoded blocks. How- 
ever nodes usually store far more than two blocks, imply- 
ing multiple iterations of the process described in Figure [5] 
More formally, to restore a failed node which was storing 
x blocks, the repair process must be iterated | times. In 
fact, as two new blocks are repaired during each iteration, 
the number of iteration is halved compared to the classical 
repair process. Note that in case of an odd number of blocks 
stored, the repair process is iterated until only one block re- 
mains. The last block is repaired downloading k blocks of 
the corresponding file which are then randomly combined to 
conclude the repair. The overhead related to the repair of the 
last block in case of an odd block number vanishes with a 
growing number of blocks stored. 

The fact that the repair process must be iterated several 
times can also be leveraged to balance the bandwidth load 
over all the nodes in the cluster. Only k + 1 nodes over 
the n of the cluster are selected at each iteration of the repair 
process; as all nodes of the cluster have a symmetrical role, a 
different set of k + 1 nodes can be selected at each iteration. 
In order to leverage the whole available bandwidth of the 
cluster, CNC makes use of a random selection of these k + 1 
nodes at each iteration. In other words, for each round of the 
repair process, the new node selects k + 1 nodes over the n 



cluster nodes randomly. Doing so, we show that every node 
is evenly loaded i.e., each node sends the same number of 
RepairBlocks in expectation. 

More formally, let N be the number of RepairBlocks sent 
by a given node. In a cluster where n nodes participate in the 
maintenance operation, for T iterations of the repair process, 
the average number of RepairBlocks sent by each node is : 

E(N)=T^^- (i) 
n 

The proof is given in Appendix. An example illustrating 
this proposition is provided in the next section. 

4. CNC ANALYSIS 

The novel maintenance protocol proposed in the previous 
section enables ( i) to significantly reduce the amount of data 
transferred during the repair process; ( ii) to balance the load 
between the nodes of a cluster; (Hi) to avoid computation- 
ally intensive decoding operations and finally (iv) to provide 
useful node reintegration. The benefits are detailed below. 

4.1 Transfer Savings 

A direct implication of Proposition [T] is that for large 
enough values of k, the data to transfer required to perform a 
repair is halved; this directly results in a better usage of avail- 
able bandwidth within the datacenter. To repair two files in 
a classical repair process, the new node needs to download 
at least 2k blocks to be able to decode each of the two files. 
Then the ratio ^± (CNC over Reed-Solomon) tends to 1/2 
as larger values of k are used. 

The exact necessary amount of data a(x,k,s) to repair x 
blocks of size s encoded with the same k is given as follows: 

/ 7 \ f f 5 (^ + l) if x is even 

a(x,k,s) - j + 1 + ifxisodd 

An example of the transfer savings is given in Figure [6j 
for k = 16 and a file size of 1GB. 
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Figure 7: Natural load balancing for blocks queried 
when repairing a failed node (node 5), for 10 blocks to 
restore. 



We described in CNC, through Proposition [T] the need to 
repair lost files by groups of two. One can wonder whether 
there is a benefit in grouping more than two files during the 
repair. In fact a simple extension of Proposition [T] is that to 
group G files together, a sufficient condition is that the new 
comer downloads (G — l)k + 1 RepairBlocks from (G — 
l)k + 1 distinct nodes over the n ones in the cluster. Firstly, 
this implies that the new node must be able to contact many 
more nodes than k + 1. Secondly, we can easily see that the 
gains made possible by CNC are maximal when grouping 
the file by two: savings in data transfer when repairing are 
expressed by the ratio ^~^ fc+1 . The minimal value of this 
ratio , which is equivalent to the maximal gain) is obtained 
for G = 2 and large value of k. 

A second natural question is whether or not downloading 
fewer than {G — l)k + 1 RepairBlocks to group G files to- 
gether is possible. We can positively answer this question, as 
the value {G — l)k + 1 is only a sufficient condition. In fact, 
if nodes do not send random combinations, but carefully 
choose the coefficients of the combination, it is theoretically 
possible to download less RepairBlocks. However, as G 
grows, finding the "adequate" coefficients becomes compu- 
tationally intractable, especially for large values of k. These 
coefficients can be found in some cases using interference 
alignment techniques (see for example |9|). However de- 
tails of these techniques are outside the scope of this paper 
as no efficient algorithm is known to solve this problem to 
date. This then calls for the use of the simpler operation i.e., 
G = 2 as we have presented in this paper. 



4.2 Load Balancing 

As previously mentioned, when a node fails, the repair 
process is iterated as many times as needed to repair all lost 



blocks. CNC ensures that the load over remaining nodes is 
balanced during maintenance; Figure [7] illustrates this. This 
example involves a 5 node cluster, storing 10 different files 
encoded with random codes (k = 2). Node 5 has failed, 
involving the loss of 10 blocks of the 10 files stored on that 
cluster. Nodes 1 to 4 are available for the repair process. 

CNC provides a load balanced approach, inherent to the 
random selection of the k + 1 = 3 nodes at each round. 
In addition, only T = 5 iterations of the repair process are 
necessary to recreate the 10 new blocks, as each iteration 
enables to repair 2 blocks at the same time. The total num- 
ber of RepairBlocks sent during the whole maintenance is 
Tx(fc + l) = 15, whereas the classical repair process needs 
to download 20 encoded blocks. The random selection en- 
sures in addition that the load is evenly balanced between 
the available nodes of the cluster. Here, nodes 1,2 and 4 are 
selected during the first repair round, then nodes 2, 3 and 4 
during the second round and so forth. The total number of 
RepairBlocks is balanced between all available nodes, each 
sending Tx ^ +1 ^ = ^ = 3.75 RepairBlocks on average. 
As a consequence of using the whole available bandwidth 
in parallel, contrary to sequentially fetching blocks for only 
a subset of nodes, the Time To Repair (TTR) a failed node 
is also greatly reduced. This is confirmed experimentally in 
Section |5] 

4.3 No Decoding Operations 

Decoding operations are known to be time consuming and 
should therefore only be necessary in case of file accesses. 
While the use of classical erasure codes requires such de- 
coding to take place upon repair, CNC avoids those cost- 
intensive operations. In fact, no file needs to be decoded at 
any time in CNC: repairing two blocks only requires to com- 
pute two linear combinations instead of decoding the two 
files. However the output of our repair process is strictly 
equivalent if files had been decoded. This greatly simplify 
the repair process over classical approaches. As a conse- 
quence, the time to perform a repair is reduced by order of 
magnitude compared to the classical reparation process, es- 
pecially when dealing with large files as confirmed by our 
experiments (Section [5]). 

4.4 Reintegration 

The decision to declare a node as failed is usually per- 
formed using timeouts; this is typically a decision prone to 
errors |5|. In fact, nodes can be wrongfully timed-out and 
can reconnect once the repair is done. While the longer the 
timeouts, the fewer errors are made, adopting large time- 
outs may jeopardize the reliability guarantees, typically in 
the event of burst of failures. The interest of reintegration is 
to be able to leverage the fact that nodes which have been 
wrongfully timed-out are reintegrated in the system. While 
this idea has already been explored using replication (5j, 
reintegration has not been addressed when using erasure 
codes. 
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Figure 8: System overview 



As previously mentioned, when using classical erasure 
codes, the repaired blocks have to be strictly identical to the 
lost ones. Therefore reintegrating a failed node in the system 
is almost useless for this results in two identical copies of the 
lost and repaired blocks. Such blocks can only be useful in 
the event of the failure of two specific nodes, the wrongfully 
timed-out node and the new node. 

On the contrary, reintegration is always useful when de- 
ploying CNC. More precisely, every single new block can 
be leveraged to compensate for the loss of any other block 
and therefore are useful in the event of the failure of any 
node. Indeed, new created blocks are simply new random 
blocks, thus different from the lost ones while being func- 
tionally equivalent. Therefore each new block contributes 
to the redundancy factor of the cluster. Assume that a node 
which has been wrongfully declared as failed returns into 
the system. A repair has been performed to sustain the re- 
dundancy factor while it turned out not to be necessary. This 
only means that the system is now one repair process ahead 
and can leverage this unnecessary repair to avoid triggering 
a new instance of the repair protocol when the next failure 
occurs. 

5. EVALUATION 

In order to confirm the theoretical savings provided by the 
CNC repair protocol, in terms of bandwidth utilization and 
decoding operations, we deployed CNC over a public exper- 
imental platform. We describe hereafter the implementation 
of the system and CNC experimental results. 

5.1 System Overview 

We implemented a simple storage cluster with an architec- 
ture similar to Hadoop [28 ] or the Google File System (14). 
This architecture is composed of one tracker node that man- 
ages the metadata of files, and several storage nodes that 
store the data. This set of storage nodes forms a cluster as 
defined in Section [3] The overview of the system architec- 
ture is depicted in Figure [8] Client nodes can PUT/GET the 
data directly to the storage nodes, after having obtained their 



IP addresses from the tracker. In case of a storage node fail- 
ure, the tracker initiates the repair process and schedules the 
repair jobs. 

All files to be stored in the system are encoded using ran- 
dom codes with the same k. Let n be the number of storage 
nodes in the cluster, then n encoded blocks are created for 
each file, one for each storage node. Remind that the system 
can thus tolerate n — h storage node failures before files are 
lost for good. 

PUT/GET and Maintenance Operations. 

In the case of a PUT operation, the client first encodes 
blocks. The coefficients of the linear combination asso- 
ciated to each encoded block are appended at the begin- 
ning of the block. Those n encoded blocks are sent to the 
n storage nodes of the cluster using a PUT_BLOCK_MSG. A 
PUT_BLOCK_MSG contains the encoded information, as well 
as the hash of the corresponding file. Upon receipt of a 
put_block_msg, the storage node stores the encoded block 
using the hash as filename. 

To retrieve the file, the client sends a get_block_msg 
to at least k out of the n nodes of the cluster. A 
GET_BLOCK_MSG only contains the hash of the file to be re- 
trieved. Upon receipt of a get_block_msg the storage node 
sends the block corresponding to the given hash. As soon as 
the client has received k blocks, the file can be recovered. 

In case of a storage node failure, a new node is selected 
by the tracker to replace the failed one. This new node 
sends a ask_repairblock_msg to k + 1 storage nodes. 
An ASK_REPAlRBLOCK_MSG contains the two hashes of the 
two blocks which have to be combined following the re- 
pair protocol described in Section [3] Upon receipt of an 
ASK_REPAlRBLOCK_MSG, the storage node combines the two 
encoded blocks corresponding to the two hashes, and sends 
the resulting block back to the new node. As soon as k + 1 
blocks are received, the new node can regenerate two lost 
blocks. This process is iterated until all lost blocks are re- 
paired. 

5.2 Deployment and Results 

We deployed the system previously described on the 
Grid5000 execution platform. The experiment ran on 33 
nodes connected with a 1GB network. Each node has 2Intel 
Xeon L5420 CPUs 2.5 GHz, 32GB RAM and a 320GB hard 
drive. We randomly chose 32 storage nodes to form a cluster, 
as defined in Section|3] The last remaining node was elected 
as the tracker. All files were encoded with k = 16, and we 
assumed that the size of each inserted file is 1GB. This size 
is used in Windows Azure Storage for sealed extents which 
are erasure coded (4). 

Scenario. 

In order to evaluate our maintenance protocol, we imple- 
mented a first phase of i files insertion in the cluster, and ar- 
tificially triggered a repair during the second phase. Accord- 
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Figure 9: Encoding time depending on file size when us- 
ing random codes. 
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ing to the protocol previously described, the tracker selects 
a new node to replace the faulty node, to which it sends to 
the list of IP addresses of the storage nodes. The new node 
then directly asks RepairB locks to storage nodes, without 
any intervention of the tracker, until it recovers as many en- 
coded blocks as the failed node was storing. We measured 
the time to repair a failed node depending on the number of 
blocks it was hosting. The time to repair is defined as the 
time between the reception of the list of IPs, and the time all 
new encoded blocks are effectively stored on the new node. 
We compared CNC against a classical maintenance mecha- 
nism (called RS), which would be used with Reed-Solomon 



codes as described in Section 2.2 and with standard repli- 
cation. All the presented results are averaged on three in- 
dependent experiments. This small number of experiments 
can be explained by the fact that Grid5000 enables to make a 
reservation on a whole cluster of nodes in isolation ensuring 
that experiments are highly reproducible and we observed a 
standard deviation under 2 seconds for all values. 

Coding. 

We developed a Java library to deal with arithmetic oper- 
ations over a finite fielcQ In this experiment, arithmetic op- 
erations are performed over a finite field with 2 16 elements 
as it enables to treat data as a stream of unsigned short in- 
tegers (16 bits). Additions and subtractions correspond to 
XOR operations between two elements. Multiplications and 
divisions are performed in the logspace using lookup tables 
which are computed offline. This library enabled us to im- 
plement classical matrix operations over finite fields, such as 
linear combinations, encoding and decoding of files. 

We measure the encoding time when using random codes 
for various code rates, depending on the size of the file to be 
encoded. Results are depicted on Figure [9] We show that for 
a given (fc, n) the encoding time is clearly linear with the file 
size. For example with (k — 16,n = 32) the encoding time 
for a file of size 512MB and 1GB are respectively 143 and 



272 seconds. In addition, the encoding time increases with 
k and with the code rate, as more encoded blocks have to be 
created. For instance, a file of 1GB with k = 16 is encoded 
in 272 seconds for a code rate 1/2 (n = 32), whereas 390 
seconds are necessary for a code rate 1/3 (n = 48). 

Transfer Time. 

We evaluated in this experiment the time to transfer the 
whole quantity of data needed to perform a complete repair 
for CNC, RS and replication, depending on the number of 
blocks to be repaired. In order to quantify the gains provided 
by CNC in isolation, we disabled the load balancing part of 
the protocol in this experiment. In other words, the same set 
of nodes is selected for all iterations of the repair process. 



The results are depicted on Figure 10 



This library will be made public along with the paper. 



Firstly, we observe on the figure that CNC consistently 
outperforms the two alternative mechanisms. As CNC in- 
curs the transfer of a much smaller amount of data, the time 
to transfer the blocks during the repair process is greatly re- 
duced compared to both RS and replication. For instance, 
to download the necessary quantity of data to repair a node 
which was hosting 10 blocks related to 10 different files, 
CNC only requires 64 seconds whereas RS and replication 
requires respectively 95 and 154 seconds on average. It 
should also be noted that no coding operations are done in 
this experiment, except for CNC as nodes have to compute a 
random linear combination of their encoded blocks to create 
a RepairBlock before sending it. This time is taken into ac- 
count, thus explaining why the transfer time for CNC is not 
exactly halved compared to RS. 

A second observation is that CNC also scales better with 
the number of files to be repaired. As opposed to CNC, both 
RS and replication involve transfer times for multiple files 
which are strictly proportional to the time to transfer a sin- 
gle file. For example RS and CNC requires 9 seconds to 
download a single file, but RS requires 95 seconds to down- 
load 10 files, while CNC only requires 64 seconds for the 
same operation. 

Finally, replication leads to the highest time to transfer. 
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This is mainly due to the fact that replication does not lever- 
age parallel downloads from several nodes as opposed to 
CNC and RS. Yet replication does not suffer from compu- 
tational costs, which can dramatically increase the whole re- 
pair time of a failure as shown in the next section. 

Repair Time. 

In this experiment, we measured the whole repair time of 
a failure, depending on the number of blocks (related to dif- 
ferent files) the failed node was storing. The results, depicted 
on Figur^TT] include both the transfer times, evaluated in the 
previous section, as well as coding times. Thereby it repre- 
sents the effective time between a failure is detected and the 
time it has been fully recovered. As replication does not in- 
cur any coding operations, the time to repair is simply the 
time to transfer the files. Note that for the sake of fairness, 
we enable the load balancing mechanism both for CNC and 
RS. 

Figur^TT] shows that the repair time is dramatically re- 
duced when using CNC compared to RS, especially with 
an increasing number of files to be repaired. For instance 
to repair a node which was hosting 10 blocks related to 10 
different files, CNC and replication require respectively 165 
and 154 seconds while RS needs 1620 seconds on average. 

These time savings are mainly due to the fact that decod- 
ing operations are avoided in CNC. In fact, the transfer time 
is almost negligible compared to the computational time for 
RS. The transfer time only represents 6% of the time to re- 
pair a node which was hosting 10 blocks related to 10 dif- 
ferent files with RS. This clearly emphasizes the interest of 
avoiding computationally intensive tasks such as decoding 
during the maintenance process. 

We can also observe that time to repair a failure with 
CNC is nearly equivalent to the one when using replication. 
As shown in Figure [TUJ replication transfer times are much 
higher than CNC ones, but this is counter-balanced by the 



fact that some coding operations are necessary in CNC. In 
other words, CNC saves time compared to replication dur- 
ing the data transfer, but these savings are cancelled out due 
to linear combination computations. Finally our experiments 
show that, as opposed to RS, CNC scales as well as replica- 
tion with the number of files to be repaired. 

Load Balancing. 



As shown in Section 3.4 CNC provides a natural load bal- 
ancing feature. The random selection of nodes from which 
to download blocks during the maintenance process ensures 
that the load is evenly balanced between nodes. In this sec- 
tion, firstly we experimentally verify that nodes are evenly 
loaded, then we evaluate the impact of this load balancing 
on the transfer time for both CNC and RS. 



Figure 12 shows the number of blocks sent by each of the 
32 nodes of the cluster for a repair of a node which was stor- 
ing 100 blocks when using CNC. This involves 50 iterations 
of the protocol, where at each iteration, k + 1 = 17 distinct 
nodes send a RepairBlock. We observe that all nodes send a 
similar number of blocks i.e., nearly 26, in expectation. This 
is consistent with the expected value analytically computed, 
according to Equation [lias 25 3 X 2 1T = 26.5625. 

Figure [T3] depicts the transfer time for both RS and CNC 
depending on the number of files to be repaired. We com- 
pare the transfer time between the load balanced approach 
(CNC-LB and RS-LB), and its counterpart which involves a 
fixed set of nodes, as done in Section [5^2] Results show that 
transfer times are reduced when load balancing is enabled, 
as the whole available bandwidth can be leveraged. In addi- 
tion, time savings due to the load balance increases as more 
files have to repaired. 

6. RELATED WORK 

The problem of efficiently maintaining erasure-coded 
content has triggered a novel research area both in theoret- 
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ical and practical communities. Design of novel codes tai- 
lored for networked storage system has emerged, with dif- 
ferent purposes. 

For instance, in a context where partial recovering may be 
tolerated, priority random linear codes have been proposed 
in |22| to offer the property that critical data has a higher 
opportunity to survive node failures than data of less impor- 
tance. Another point in the code design space is provided 
by self-repairing codes [ 24 1 which have been especially de- 
signed to minimize the number of nodes contacted during a 
repair thus enabling faster and parallel replenishment of lost 
redundancy. 

In a context where bandwidth is the scarcest resource, net- 
work coding has been shown to be a promising technique 
which can serve the maintenance process. Network coding 
was initially proposed to improve the throughput utilization 
of a given network topology E). Introduced in distributed 
storage systems in (7), it has been shown that the use of net- 
work coding techniques can dramatically reduce the mainte- 
nance bandwidth. Authors of (7) derived a class of codes, 
namely regenerating codes which achieve the optimal trade- 
offs between storage efficiency and repair bandwidth. In 
spite of their attractive properties, regenerating codes are 
mainly studied in an information theory context and lack of 
practical insights. Indeed, this seminal paper provides theo- 
retical bounds on the quantity of data to be transferred during 
a repair, without supplying any explicit code constructions. 
The computational cost of a random linear implementation 
of these codes can be found in p0| . A broad overview of the 
recent advances in this research area are surveyed in [9]. 

Very recently, authors in [ 20 ] and (25) have designed new 
code especially tailored for cloud systems. Paper [20] pro- 
posed a new class of Reed-Solomon codes, namely rotated 
Reed-Solomon codes with the purpose of minimizing I/O for 
recovery and degraded read. Simple Regenerating Codes, in- 
troduced in (25), trade storage efficiency to reduce the main- 
tenance bandwidth while providing exact repairs, and simple 
XOR implementation. 

Some other recent works (T7][T8) aim to bring network 
coding into practical systems. However they rely on code 
designs which are not MDS, thus consuming more storage 
space, or are only able to handle a single failure hence limit- 
ing their application context. 

7. CONCLUSION 

While erasure codes, typically Reed-Solomon, have been 
acknowledged as a sound alternative to plain replication in 
the context of reliable distributed storage systems, they suf- 
fer from high costs, both bandwidth and computationally- 
wise, upon node repair. This is due to the fact that for each 
lost block, it is necessary to download enough blocks of the 
corresponding file and decode the entire file before repair- 
ing. 

In this paper, we address this issue and provide a novel 
code-based system providing high reliability and efficient 
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maintenance for practical distributed storage systems. The 
originality of our approach, CNC, stems from a clever 
cluster-based placement strategy, assigning a set of files to 
a specific cluster of nodes combined with the use of random 
codes and network coding at the granularity of several files. 
CNC leverages network coding and the co-location of blocks 
of several files to encode files together during the repair. This 
provides a significant decrease of the bandwidth required 
during repair, avoids file decoding and provides useful node 
reintegration. We provide a theoretical analysis of CNC. We 
also implemented CNC and deployed it on a public testbed. 
Our evaluation shows dramatic improvement of CNC with 
respect to bandwidth consumption and repair time over both 
plain replication and Reed-Solomon-based approaches. 
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APPENDIX 

Proof of Proposition 1 

Lemma 1. A linear combination of independent random 
variables chosen uniformly in a finite field ¥ q also follows a 
uniform distribution over ¥ q . 

PROOF. Let Sat be the random variable defined by the lin- 
ear combination of N random variables {Xl, X<i, Xn} . 
These TV random variables are independent and take their 
values uniformly in the finite field ¥ q . 
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N 



with Vi, X{ G Fg, and a* G F* 



We show by recurrence that if Vi, Pr(Xi = xi) = ^ then 

Pr(S,v = 5 N ) = \ 

The case AT = 1 is trivial. Let first show that for N = 2 
the proposition is true. 



Let B x be an encoded block of the file F x stored on node i. 
B x is a random linear combination of the {Xi, X 2 , X&}, 
thus 5* G span{Xi, X 2 , X&} = X which is a subspace 
of ¥ l q of dimension Dim(X) < k. 

Lemma 2. VDim(X), 5* is a random vector in X. 

PROOF. Let B be the largest family of linearly indepen- 
dent vectors of {Xi , X 2 , . . . , Xk } 
VZ I X l ^B,3\{b[,...,b l j }suchth a tX l =j: jlXj€B b l j X J 



S 2 = ol\X\ + ol 2 X 2 
Pr(S 2 = s 2 ) = Pr(aiXi + a 2 X 2 = s 2 ) 
9-1 



Pr(X 1 = x 1 )Pr(X 2 



S 2 — OL\X\ 
OL 2 



xi=0 
^ 11 11 

i 

The proposition is thus true for X = 2. We suppose that it 
is true for all X, and prove that it is true for X + 1. 

Sat+i = Sat + o/v+iXjv+i 
Pr(§Ar+i = sat+i) = Pr(E> N + ajv+iXjv+i = sjv+i) 

9-1 

= E I Pr (^+l = x N+l) 
x N+1 =0 

x Pr(SAr = s N+1 - a N+1 x N+1 )} 
^ 11 11 

^+1=0 * 4 4 4 

1 

□ 

Definition 1. A random vector V in a vector space 
X = span{Xi, X 2 , Xjfc} where Xi G F^ is defined as 



2=1 

where the oti coefficients are chosen uniformly at random 
in the field ¥ q , ie. Pr(c^ = a) = ^, Va G ¥ q . 

Let X be the vector space defined as 
span{Xi, X 2 , Xk}. Let Y be the vector space de- 
fined as span{Yi, F 2 , Yfc}. No assumptions are made 
on Xi and Y{ except that they are all in ¥ l q . In fact as Xi 
and Yi are file blocks, it is not possible to ensure linear 
independence for example. 



= E 4**+ E a ^ 

j\x 3 es i\Xt2B 

= EK'+E a i b ^ x J 

j\X 3 eB l\Xt#B 

From Lemma [TJ all the coefficients of the linear combi- 
nation are random over ¥ q thus B x is a random vector in 
span(S). 

As span(S) = span{Xi , X 2 , . . . , X& } = X 
Then B x is a random vector in X. □ 

Let D l be the random linear combination of two stored 
blocks by the node i with i G [1, k + 1]. 

D* = 5iBi + 5lBl 

k k 

= 4(£4^-) + aj(£ a i r ') 

By definition, D % G span{Xi , X 2 , . . . , Xk , Y\ , . . . , } 

As <S* are chosen randomly in F g , then from Lemma [l] 
D x is a random vector in X. 

Let's take a family D^ 1 }. 
As Dim(X) < k it exists {ai, a/c+i} ^ such that 

Thus : 



fe+i 



fe+i 



fc+i 



2 = 1 



2=1 



£«<£>* = £«<£>*+£«<£>* 

2=1 
fc + 1 



2 = 1 



As are chosen independently with then new vector 
is a random vector in Y. The reasoning is identical to get the 
new vector in X, thus completing the proof. 
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Proof of Equation (1) 

During the repair process, the load on each node can be 
evaluated using a Balls-in-Bins model. Balls correspond to 
a block to be downloaded while bins represents the nodes 
which are storing the blocks. For each iteration of the re- 
pair protocol, k different nodes are selected to send a repair 
block. This corresponds to throwing k identical balls into n 
bins, with the constraints that once a bin has received a ball, 
it can not receive another ball at this round. In other words 
exactly k different bins are chosen at each round. 



Lemma 3. At each round i, the probability that a given 
bin has received one ball is - 



PROOF. Let A be the event "the bin contains one ball at 
round z". Thus A corresponds to the event "the bin is empty 
at round z". Pr(A) is computed as the number of ways to 
place the k balls inside the n — 1 remaining bins, over all the 
possibilities to place the k balls into the n bins. 



Pr(A) = 1 - Pr(A) 



_ 1 fe!(n-l-fc)! 

k\(n-k)\ 

_ x (n-k)\ 

n(n — 1 — k)\ 
= 1 (n-fc) 
n 

n 

□ 

Let X be the number of balls into a given bin after t 
rounds. As the selection at each round are independent, the 
number of balls into a given bin follows a binomial law : 
X ~ B with p = ^ (See Lemma ^ The expected 
value, denoted E(X), of the Binomial random variable X 
with parameters t and p is : E(X) = tp = 
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